Skip to content

fix: CPU OOM issue during LoRA training#41

Merged
xiami2019 merged 1 commit intoOpenMOSS:mainfrom
SongwuJob:main
Mar 4, 2026
Merged

fix: CPU OOM issue during LoRA training#41
xiami2019 merged 1 commit intoOpenMOSS:mainfrom
SongwuJob:main

Conversation

@SongwuJob
Copy link
Contributor

When conducting fine-tuning with the provided LoRA training script, CPU memory usage continuously increases over time and eventually the process is killed by the system due to out-of-memory (OOM).

The issue is caused by enabling torch.cuda.memory._record_memory_history(enabled="all"), which records CUDA memory events and stores them on the CPU. As training progresses, the accumulated memory history leads to excessive CPU memory consumption, resulting in CPU OOM.

@xiami2019 xiami2019 merged commit ee050e4 into OpenMOSS:main Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants